Applying polygenic risk score methods to pharmacogenomics GWAS: challenges and opportunities

Abstract Polygenic risk scores (PRSs) have emerged as promising tools for the prediction of human diseases and complex traits in disease genome-wide association studies (GWAS). Applying PRSs to pharmacogenomics (PGx) studies has begun to show great potential for improving patient stratification and drug response prediction. However, there are unique challenges that arise when applying PRSs to PGx GWAS beyond those typically encountered in disease GWAS (e.g. Eurocentric or trans-ethnic bias). These challenges include: (i) the lack of knowledge about whether PGx or disease GWAS/variants should be used in the base cohort (BC); (ii) the small sample sizes in PGx GWAS with corresponding low power and (iii) the more complex PRS statistical modeling required for handling both prognostic and predictive effects simultaneously. To gain insights in this landscape about the general trends, challenges and possible solutions, we first conduct a systematic review of both PRS applications and PRS method development in PGx GWAS. To further address the challenges, we propose (i) a novel PRS application strategy by leveraging both PGx and disease GWAS summary statistics in the BC for PRS construction and (ii) a new Bayesian method (PRS-PGx-Bayesx) to reduce Eurocentric or cross-population PRS prediction bias. Extensive simulations are conducted to demonstrate their advantages over existing PRS methods applied in PGx GWAS. Our systematic review and methodology research work not only highlights current gaps and key considerations while applying PRS methods to PGx GWAS, but also provides possible solutions for better PGx PRS applications and future research.

population from 1000 Genomes (1000G) Phase 3 European samples (n = 503).We used the HapMap3 CEU sample (n = 234) as an external LD reference panel.22,630 SNPs on chromosome 19 were left after matching between the 1000G and the HapMap3 datasets.

A.2 Simulate prognostic and predictive effect sizes
The SNP prognostic and predictive effect sizes were simulated jointly with the following distribution: with probability π k , 0 with probability 1 − π k .
where π k ∼ Beta(p, 1 − p), p denoted the proportion of causal SNPs, j was the index of j-th SNP, and k was the index of k-th LD block [1].Furthermore, , where ρ k ∼ Uniform(0,1).
ψ/ξ = 1 implied that the prognostic and predictive effects were in the same scale, while ψ/ξ ≠ 1 implied that one effect was dominant to another effect.It was worth noting that the above simulation indicates each causal SNP carried some degree of prognostic effect and some degree of predictive effect.To generate completely separated prognostic and predictive SNPs, we randomly chose half of the causal variants and only kept their prognostic effects (i.e., artificially shrank α j = 0); for the rest half of the causal variants, only predictive effects were kept (i.e., artificially shrank β j = 0).

A.3 Simulate disease and PGx GWAS summary statistics in the base cohort
To simulate disease GWAS data in the base cohort, the phenotype was generated as: where n = 50,000 and m = 22,630.Then we calculated disease GWAS summary statistics with the simulated individual-level data.To simulate PGx GWAS data in the base cohort, the phenotype was generated as: where n = 1,000, 5,000, or 10,000 and m = 22,630.Then we calculated PGx GWAS summary statistics with the simulated individual-level data.

A.4 Simulate PGx GWAS data in the target cohort
The data generation process described in A.3 (Equation 1) was repeated to generate 5,000 individual-level PGx GWAS data with two arms (1-to-1 ratio) from randomized clinical trial (RCT).

A.5 Simulate PGx GWAS data for validation
The data generation process described in A.3 (Equation 1) was repeated to generate 1,000 individual-level PGx GWAS data with two arms (1-to-1 ratio) from randomized clinical trial (RCT).This dataset was used as the validation data for selecting the optimal tuning parameters.
Assume 1) Y k is standardized so that Y k ′ Y k /N k = 1; 2) the prognostic and predictive effect sizes from PGx GWAS .
Now we can already see that , where iG denotes the inverse Gamma distribution.
Notice that

B.2 Posterior distribution of 𝐌 𝐣
The posterior distribution of M j borrows the information across K studies: . Therefore

C. Data generation process for trans-ethnic simulation studies C.1 Simulate genotype data
Simulation studies were performed with simulated genotype data using sim1000G v1.40 software for both EUR and
Here, ρ E measures the correlation between prognostic and predictive effects, ρ P measures the correlation between European and non-European populations, and ρ C measures the correlation between base and target cohorts.

C.3 Simulate disease and PGx GWAS summary statistics in the base cohort
We repeated Supplementary Method A.3 for EUR and EAS populations, respectively.Specifically, in the base cohort, the disease GWAS summary statistics of EUR and EAS populations were calculated based on 50,000 and 10,000 subjects, respectively; the PGx GWAS summary statistics of EUR and EAS populations were calculated based on 5,000 and 1,000 subjects, respectively.

C.4 Simulate PGx GWAS data in the target cohort
We repeated Supplementary Method A.4 for EUR and EAS populations, respectively.Sample sizes of EUR and EAS populations in the target cohort were 5,000 and 1,000, respectively.